========================================================
The AT&T Data for Diplomas initiative is a hackathon posted on Devpost. The goal of the hackathon was to identify factors that contributed to high school graduation rates, in an effort to increase graduation rates to 90% by 2020. I am performing an exploratory data analysis on the data provided for that hackathon. It is a combined dataset including Federal high school graduation data and demographics, American Community Survey (ACS) data, and 2010 US Census Data. Although I didn’t enter the hackaton, the fact that I was brought up in a rural area with a low graduation rate makes me very interested in actionable ways that we can increase the success of American high school students.
This is a large dataset, containing 9907 observations of 580 variables. The data schema for the entire dataset is included as part of the files for this project. The dataset combines the graduation data for each school district with the demographic data for the census tracts that overlap geospatially with the school district. For this data analysis, I want to focus on several areas that I believe may effect high school graduation rates:
Ethnicities of the high school cohort
Population of disabled students in the highschool cohort
Geographic variables (state, county)
Primary language spoken in the home, if English is not spoken “very well”
Population density
Poverty level
Employment/Unemployment rate
Education level of those over age 25
Age distribution of the population
Ethnicities of the population
Looking at the data schema, I notice that there is both population data and percentage data. For example, the Hispanic population data has a variable field for both percent of population and number of poeple per census track. I will focus my efforts on understanding the percentage variables for the fields of interest listed above.
I want to trim the data in order to exclude variables that have a high number of NA values. I realize that I could impute these values from the other data provided, but since I have such a large number of variables to choose to look at, I will only consider those with a small number of NA values. The summary() function contains a count of NA values for each variable. I can use this to weed out previously identified variables of interest that have incomplete data.
After looking at the data schema and structure of the data, I’ve reduced the dataset I will be considering to 75 variables (prior to further wrangling) with 9907 observations.
However, 40 of those variables are language specific variables: the percent of the population that speaks a specific language in the houshold, if English is not spoken “very well”. According to this link from the census bureau, only a handful of languages other than English are spoken by a significant number of people. I will only consider the top five non-English languages in this analysis, and combine the others into a new variable. The languages I will consider individually are: Spanish, Chinese, Tagalog, Vietnamese, and French.
Another round of cleaning has to occur with the graduation data obtained from the DOE. For the variables “MAM_RATE_1112”, “MAS_RATE_1112”, “MBL_RATE_1112”, “MHI_RATE_1112”, “MTR_RATE_1112”, “MWH_RATE_1112”, “CWD_RATE_1112”, “ECD_RATE_1112”, the graduation rates are reported in a series of non-uniform bins. They are factor variables with a large number (>50) of levels. For example, for the percentage of white students in a graduation cohort (MWH_RATE_1112), the structure of the variable is:
## Factor w/ 80 levels "","15-19","20-29",..: 57 48 67 66 44 63 63 50 50 80 ...
I will need to convert these factor variables to numeric variables using a mapping in order to facilitate downstream analysis. Since the reporting of this graduation data is inconsistent, with some districts reporting a discrete percentage of graduates, and others reporting a range, I will choose the largest numerical range that is consistent across all variables in this data. In this case, I will bin the data in the following percentages: 0-19%, 20-39%, 40-59%, 60-79%, 80-100%. I can then convert the factor variable to numeric variables, with each number (1-5) representing a quintile. In summary, a reported cohort graduation rate of 73% would be represented as 4 - because it falls within the fourth quintile. Likewise, a reported graduation rate of 20-29% would be converted to the numeral 2 - because it falls within the second quintile. Note: Categories levels GE50, GE90, LE20, LE5, LT50, and PS will be excluded from the analysis and converted to NA in this mapping. These designations are DOE designations used to protect the identity of students. For example,the PS level is used if there are five or less students in a graduation cohort meeting the criteria of the variable. Thus, I deem this data incomplete and will exclude it from the analysis.
It is worth noting that a considerable loss of information occurs when the graduation rate for special populations is binned into quintiles. This, however, cannot be helped due to inconsistent reporting of graduation rates in these populations.
First I want to look at the graduation rates of all students:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.00 80.00 87.00 83.04 92.00 99.00 122
This first plot is the reported graduation rate for all school districts for all students. This is not a normal distribution. It is very negatively skewed. This is evident both from the plots and from the stat summary. The quartiles are very close together. I tried various transformations of the x and y axis to see if I can spot trends in the data. The transformations didn’t provide me with much more information. However, I do notice distince spikes in the data. I wonder what is causing them? A large number of schools report 50%, 80%, 82%, 84%, 87%, 90%, 92%, and 95% graduation rates. There are also graduation rates well below 50%. These are concerning. I wonder what could cause such low graduation rates. When I look at total graduation rates from 0-50%, there are much fewer of them, but again, they seem to be reported in discrete intervals. Perhaps when the different states report graduation rates they round to some common percentage? The variation in state reporting methods may very well account for these spikes. This is something to look at during the multivariate analysis. I chose to view this data as a histogram instead of a box plot because of these spikes: in a box plot they would not have been evident. Histograms are better at visualizing large variances in the observed frequencies of in the dataset.
Next I want to look at how the graduation rate differs with ethnicity:
There is wide variances both in the number of observations of graduation rates of students of different ethnicities, as well as the actual graduation rates. I expect that many schools did not report graduation rates for Native Americans (MAM) since many schools may have a low percentile of Native Americans in the graduating cohort. Conversely, there are a large number of schools reporting graduation rates for whites (MWH) - they are the dominant race in the United States. A boxplot may better visualize differences in graduation rates between the cohorts.
## V1 V2 V3 V4
## Min. :2.000 Min. :2.0 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:4.0 1st Qu.:4.000 1st Qu.:4.000
## Median :4.000 Median :5.0 Median :4.000 Median :4.000
## Mean :3.593 Mean :4.7 Mean :4.282 Mean :4.343
## 3rd Qu.:4.000 3rd Qu.:5.0 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.0 Max. :5.000 Max. :5.000
## NA's :9607 NA's :9407 NA's :7836 NA's :7850
## V5 V6
## Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:5.000
## Median :4.000 Median :5.000
## Mean :4.358 Mean :4.819
## 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000
## NA's :9575 NA's :4063
The box plots show that Native Americans (V1 in the summary) have the lowest mean gradiation rate, and the largest difference in minimum and maximum graduation rates, with no suspected outliers. The white cohort (V6 in the summary) has the highest mean graduation rate, but with significant suspected outliers. The mean and quartiles of the other four variables are between these two extremes. What are some demographic reasons for the difference in graduation rates between races? Language spoken in the home may influence graduation rates in the cohort identifying as Hispanic. Otherwise I hypothesize that demographic variables such as poverty level, employment status, and eductaion level of the population influence graduation rates. These are areas to look at during the bi/multi-variate analysis.
The other two statistics I want to consider from the graduation data are percentage of children with disabilities graduating within 4 years, and the percentage of children from economically disadvantaged households graduating within four years.
## V1 V2
## Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:4.000
## Median :4.000 Median :4.000
## Mean :3.728 Mean :4.363
## 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000
## NA's :5971 NA's :4409
Note: the graduation rate for all students was not binned into quintiles. This data is used for a visual comparison only. The mean graduation rate of students with disabilities(CWD) (V1 in summary) is lower than the graduation rate of any special population except Native Americans and Blacks. And, the students from economically disadvantaged households(ECD) graduate at a lower rate than all students in total. Interesting, the distribution of frequencies in these two data sets is flipped: the CWD distribution is negatively skewed, while the ECD distribution is positively skewed. There is also a large variance in the CWD graduation rates. This is interesting, and it indicates that some school districts have high graduation rates of ECD and CWD students. It will be interesting to see what the demographics of the school districts are.
I’d like to look at the distributions of these variables as well to see if there are significant differences not shown by box plots.
In summary, the univariate analysis of the graduation data shows that the graduation rates of Native Americans, students with disabilities, Blacks, and economically disadvantaged students show the most potential for improvement. It is worth noting that these populations are not mutually exclusive, and their relationships will be explored in the next section. Before continuing, however, I would like to look at the demographics of the US population as a whole to provide a baseline for comparison of special populations during the bi/multivariate analysis.
I am interested in exploring the population by age, race, launguage use, education level (of those over age 25), poverty level, and unemployment. These variables from the ACS data have been merged with the graduation data so the data is present for each school district in the country.
First I’ll look at how the US population varies with age.
## V1 V2 V3 V4
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 4.613 1st Qu.:15.09 1st Qu.: 6.036 1st Qu.:20.67
## Median : 5.863 Median :17.52 Median : 7.637 Median :23.23
## Mean : 6.081 Mean :17.45 Mean : 8.776 Mean :23.64
## 3rd Qu.: 7.353 3rd Qu.:19.88 3rd Qu.: 9.559 3rd Qu.:26.14
## Max. :23.255 Max. :57.14 Max. :100.000 Max. :68.16
## NA's :56 NA's :56 NA's :56 NA's :56
## V5 V6
## Min. : 0.00 Min. : 0.00
## 1st Qu.:25.32 1st Qu.:12.24
## Median :28.55 Median :15.44
## Mean :28.26 Mean :15.79
## 3rd Qu.:31.54 3rd Qu.:18.95
## Max. :75.32 Max. :81.08
## NA's :56 NA's :56
This plot shows the age distribution of the population of all census tracts correlated with a school district. All 6 population bins are roughly normally distributed (see plot and summary statistics), and the baby boomer population is easily recognized as the largest population in the US (blue on plot, V5) with a mean of 28.26% of the population in census tracts. It will be interesting to see how population distribution correlates with highschool graduation rates. Note: while this (and following) plots are not true univariate plots, I’ve chosen to cobine the plots and include it in the univariate section because I am only looking at one demographic variable in these plots. The census data just bins these variables.
Now I want to look at how the US populationis distributed by race:
## V1 V2 V3 V4
## Min. : 0.000 Min. : 0.00 Min. : 0.0000 Min. : 0.00000
## 1st Qu.: 1.103 1st Qu.: 72.93 1st Qu.: 0.1106 1st Qu.: 0.00000
## Median : 2.806 Median : 90.25 Median : 0.7939 Median : 0.06648
## Mean : 9.094 Mean : 79.99 Mean : 6.1687 Mean : 1.46079
## 3rd Qu.: 8.492 3rd Qu.: 95.76 3rd Qu.: 4.0859 3rd Qu.: 0.52649
## Max. :100.000 Max. :100.00 Max. :100.0000 Max. :100.00000
## NA's :56 NA's :56 NA's :56 NA's :56
## V5 V6 V7
## Min. : 0.000 Min. : 0.00000 Min. :0.0000
## 1st Qu.: 0.000 1st Qu.: 0.00000 1st Qu.:0.0000
## Median : 0.325 Median : 0.00000 Median :0.0000
## Mean : 1.630 Mean : 0.05658 Mean :0.1004
## 3rd Qu.: 1.295 3rd Qu.: 0.00000 3rd Qu.:0.0000
## Max. :85.817 Max. :11.49298 Max. :9.0043
## NA's :56 NA's :56 NA's :56
A boxplot with the y-values transformed on the log10 scale was the most informative method of plotting this data. Note: the data is plotted in grey behind the summary boxplots for each variable. In the majority of census tracts Whites make up most of the population. (Mean = 79.99% in each tract,V2), while those identifying as native Hawaiian or Pacific Islander only make up on average 0.06% in each census tract (V5). Interestingly, there are some census tracts that trend heavily non-white, Hispanic, Black, Native American (AIAN), and Asian, indicated by the summary statistics and seen on the box plot. It will be interesting to look at the graduation rates of these outlier school districts. Common perception is that school districts that are a high percentage non-white tend to be low performing. I wonder if the data will actually support this hypothesis. I would suspect that there are other factors instead of/in addition to race that also effect graduation rates.
Another factor that may influence graduation rates is language spoken in the home. What is the distribution of languages in the United States?
## V1 V2 V3 V4
## Min. : 0.00 Min. : 0.00000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 88.02 1st Qu.: 0.07157 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 94.97 Median : 0.52811 Median : 0.0000 Median : 0.0000
## Mean : 89.17 Mean : 2.78207 Mean : 0.1558 Mean : 0.0651
## 3rd Qu.: 97.51 3rd Qu.: 2.16709 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :100.00 Max. :61.12830 Max. :32.8576 Max. :55.5556
## NA's :56 NA's :56 NA's :56 NA's :56
## V5 V6
## Min. : 0.00000 Min. : 0.00000
## 1st Qu.: 0.00000 1st Qu.: 0.00000
## Median : 0.00000 Median : 0.00000
## Mean : 0.07423 Mean : 0.06239
## 3rd Qu.: 0.00000 3rd Qu.: 0.00000
## Max. :15.12116 Max. :15.00873
## NA's :56 NA's :56
The boxplot and frequency polygon show that English is the primary language for the majority (mean = 89%) of Americans in each census tract. However,we can see that in some census tracts up to 71% of people over the age of five speak a language other than the six major languages in the US (pct_OtherLang, V7). Obviously,there is great diversity of languages spoken in each census tract, and it may correlate with the graduation rates. Again, outlier analysis may be interesting.
Since I am looking at graduation rates, what is the education level of the US population?
## V1 V2
## Min. : 0.000 Min. : 0.00
## 1st Qu.: 7.723 1st Qu.: 12.83
## Median :11.813 Median : 17.84
## Mean :13.885 Mean : 22.03
## 3rd Qu.:18.051 3rd Qu.: 26.43
## Max. :79.624 Max. :100.00
## NA's :58 NA's :58
Looking at frequency histograms and boxplots of the entire dataset, it appears that the the percent of adults who are not highschool graduates and those that are college graduates are very similar. The mean percentage per census tract of non-high school grads is 14%, and the mean percentage of college grads is 22%. However, there significant outliers in both variables. Both histograms are positively skewed. I wonder what the correlation is between the two populations. Are there school districts/census tracts where there is both a large percentage of non-high school grads and college grads? What is the graduation rate for those districts? What is the correlation between adult education level and high school graduation rates?
In addition to race, language, and education level, I suspect poverty may play a role in graduation rates. What is the distribution of poverty levels in the United States?
The histogram of poverty rates is positively skewed. The majority of school districts have poverty levels below 25 percent. However, there are some school districts with poverty rates higher than 50%. I imagine that the household poverty level correlates with the percent of economically disadvantaged students in the graduating cohort, but how does that effect total graduation rates?
Another demographic variable that is often correlated with poverty rate is unemployment rate. What is the age distribution of unemployment rates in the US?
## V1 V2 V3 V4
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 5.031 1st Qu.: 8.594 1st Qu.: 3.663 1st Qu.: 2.915
## Median : 7.483 Median : 15.545 Median : 6.692 Median : 5.190
## Mean : 8.501 Mean : 17.977 Mean : 8.017 Mean : 6.184
## 3rd Qu.:10.876 3rd Qu.: 24.615 3rd Qu.: 10.836 3rd Qu.: 8.318
## Max. :65.714 Max. :100.000 Max. :100.000 Max. :89.474
## NA's :64 NA's :77 NA's :73 NA's :75
## V5
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 5.171
## 3rd Qu.: 7.543
## Max. :100.000
## NA's :185
The unemployment rates in total and for each age binning appear to be very similar when assessed via histogram. However, when we look at the stats, we see that the mean unemployment rate varies between 5.2% for ages 65+ (V5) and 18% for ages 16-24 (V2). It is likely that the unemployment rate is higher for this latter population because a percentage of the population would still be in school (either high school or college). How does the high school graduation rate correlate with the unemployment rate for each age group? Is a high unemplyment rate for people ages 16-24 indicative or higher, lower, or neutral graduation rates?
In summary, a univariate analysis of the census data provides a snapshot of the US population as a whole, but no conclusions can be made about how demographic data correlates with high school graduation rates. A more detailed analysis will be needed.
One set of variables that I did not include in the above univariate anaylsis was that related to location (State, County). Now, however, it will be interesting to see how graduation rates vary by location.
The graduation rate varies widely by state, however, from the choropleth, it is obvious that there are several regions where the graduation rate is significantly lower than elsewhere in the country: the Southeast and the Great Plains being the most prominent areas. The Northeast and West Coast appear to have higher graduation rates. No graduation data is available for Idaho, Oklahoma, and Kentucky. Are the demographics between the higher-performing regions significantly different than the low performing regions? Or is there better funding for education/different educational policies that influence graduation rates? This investigation will seek to answer the first question.
Looking at graduation rates by state is a rather coarse-grained approach to investigating the geographic relationship between graduation rates. I can also look at graduation rates by county.
In general, the trends that were observed at a statewide level are observed at the county level. The Southeast and Great Plains appear to have lower graduation rates than the Northeast, in particular. However, there are pockets of areas with high graduation rates in these areas as well. Of note, only one county in Oregon and one county in Hawaii reported graduation rates in this dataset, this observation was not observable when looking at data at the statewide level. Also, the binning in the first map (creating 9 equally sized bins) obscures counties with very low graduation rates, since the first bin contains graduation rates from 20-68.9%. It is easier to pick out very low graduation rates on a gradient map.
The gradient map is not as useful in looking at the graduation rates in the majority of the counties, since the data is so highly skewed, but it does highlight the outliers well: There are some counties in Colorado, South Dakota, Nebraska, and Kansas where the graduation rate is very low. The two counties with the lowest graduation rates are: Sedgwick County, CO (08115), Kiowa County, KS (20097).
I looked up the statistics for these counties - Sedgwick County only has a total population of ~2000 people, it’s possible that students from these census tracts go to other surrounding high schools. Kiowa County was hit by an EF5 tornado in 2007 - the beginning year of our graduating cohort, again, this may negatively influence graduation rates since most students would presemably leave the county.
## Source: local data frame [2 x 3]
##
## region mean value
## (dbl) (dbl) (dbl)
## 1 8115 20 20
## 2 20097 32 32
Clearly there is a wide geographic variation in graduation rates, that may or may not be related to population demographics. I could explore this variable in greater depth, but instead I would like to go ahead and start looking at the rest of the demographic data.
Since even the pared-down dataset has more than 80 variables, I want to first use a correlation matrix to focus my further analysis. The correlation matrix will allow me to pick out not-obvious trends in how the variables are related and may lead to some interesting insights.
## row column cor
## 905 State FIPS 0.9999803
## 182 pct_Hispanic_ACS_08_12 pct_Age5p_Spanish_ACS_08_12 0.8864271
## 164 pct_Hispanic_ACS_08_12 pct_Othr_Lang_ACS_08_12 0.8556968
## 190 pct_Othr_Lang_ACS_08_12 pct_Age5p_Spanish_ACS_08_12 0.8213596
## 495 pct_Civ_unemp_16p_ACS_08_12 pct_Civ_unemp_25_44_ACS_08_12 0.7954510
## 781 ALL_RATE_1112 MWH_quint 0.7371281
## 309 pct_Pop_45_64_ACS_08_12 pct_Pop_25yrs_Over_ACS_08_12 0.7212008
## 148 pct_NH_White_alone_ACS_08_12 pct_Age5p_Only_Eng_ACS_08_12 0.7100343
## 862 ALL_RATE_1112 ECD_quint 0.7059010
## 704 ALL_RATE_1112 MHI_quint 0.7017753
## 526 pct_Civ_unemp_16p_ACS_08_12 pct_Civ_unemp_45_64_ACS_08_12 0.6943071
## 667 ALL_RATE_1112 MBL_quint 0.6923416
## 821 ALL_RATE_1112 CWD_quint 0.6732302
## 900 MHI_quint ECD_quint 0.6714531
## 899 MBL_quint ECD_quint 0.6665194
## 465 pct_Civ_unemp_16p_ACS_08_12 pct_Civ_unemp_16_24_ACS_08_12 0.6419508
## 310 pct_Pop_65plus_ACS_08_12 pct_Pop_25yrs_Over_ACS_08_12 0.6355349
## 345 pct_Age5p_Spanish_ACS_08_12 pct_Not_HS_Grad_ACS_08_12 0.6255256
## 405 pct_Not_HS_Grad_ACS_08_12 pct_Prs_Blw_Pov_Lev_ACS_08_12 0.6058463
## 596 ALL_RATE_1112 MAM_quint 0.5961023
## 336 pct_Hispanic_ACS_08_12 pct_Not_HS_Grad_ACS_08_12 0.5726486
## 741 MBL_quint MHI_quint 0.5681063
## 742 ALL_RATE_1112 MTR_quint 0.5629030
## 859 MHI_quint CWD_quint 0.5521337
## 897 MAM_quint ECD_quint 0.5515264
## row column cor
## 36 pct_Pop_25_44_ACS_08_12 pct_Pop_45_64_ACS_08_12 -0.3672954
## 306 pct_Pop_5_17_ACS_08_12 pct_Pop_25yrs_Over_ACS_08_12 -0.3695661
## 151 pct_NH_Asian_alone_ACS_08_12 pct_Age5p_Only_Eng_ACS_08_12 -0.3729777
## 42 pct_Pop_5_17_ACS_08_12 pct_Pop_65plus_ACS_08_12 -0.3802327
## 305 pct_Pop_Under_5_ACS_08_12 pct_Pop_25yrs_Over_ACS_08_12 -0.3825920
## 579 pct_Age5p_Only_Eng_ACS_08_12 pct_OtherMinorLang -0.4055922
## 418 pct_NH_White_alone_ACS_08_12 pct_Civ_unemp_16p_ACS_08_12 -0.4095596
## 406 pct_College_ACS_08_12 pct_Prs_Blw_Pov_Lev_ACS_08_12 -0.4182655
## 33 pct_Pop_Under_5_ACS_08_12 pct_Pop_45_64_ACS_08_12 -0.4222113
## 387 pct_Pop_45_64_ACS_08_12 pct_Prs_Blw_Pov_Lev_ACS_08_12 -0.4291700
## 404 pct_Pop_25yrs_Over_ACS_08_12 pct_Prs_Blw_Pov_Lev_ACS_08_12 -0.4306279
## 390 pct_NH_White_alone_ACS_08_12 pct_Prs_Blw_Pov_Lev_ACS_08_12 -0.4780723
## 343 pct_Age5p_Only_Eng_ACS_08_12 pct_Not_HS_Grad_ACS_08_12 -0.5095332
## 35 pct_Pop_18_24_ACS_08_12 pct_Pop_45_64_ACS_08_12 -0.5370445
## 44 pct_Pop_25_44_ACS_08_12 pct_Pop_65plus_ACS_08_12 -0.5370823
## 337 pct_NH_White_alone_ACS_08_12 pct_Not_HS_Grad_ACS_08_12 -0.5726485
## 78 pct_NH_White_alone_ACS_08_12 pct_NH_Blk_alone_ACS_08_12 -0.5840800
## 378 pct_Not_HS_Grad_ACS_08_12 pct_College_ACS_08_12 -0.5953885
## 183 pct_NH_White_alone_ACS_08_12 pct_Age5p_Spanish_ACS_08_12 -0.6230145
## 66 pct_Hispanic_ACS_08_12 pct_NH_White_alone_ACS_08_12 -0.6853153
## 165 pct_NH_White_alone_ACS_08_12 pct_Othr_Lang_ACS_08_12 -0.7100343
## 307 pct_Pop_18_24_ACS_08_12 pct_Pop_25yrs_Over_ACS_08_12 -0.7385868
## 189 pct_Age5p_Only_Eng_ACS_08_12 pct_Age5p_Spanish_ACS_08_12 -0.8213596
## 147 pct_Hispanic_ACS_08_12 pct_Age5p_Only_Eng_ACS_08_12 -0.8556968
## 171 pct_Age5p_Only_Eng_ACS_08_12 pct_Othr_Lang_ACS_08_12 -1.0000000
After removing all the minor languages, there are 43 variables to consider in the dataset. A correlation matrix will consist of 1849 correlations between variables. Printed above are the fifty most positive and 25 most negative correlations between variables, sorted first by Pearson correlation coefficient then by P-value. State and FIPS are near perfectly correlated since the State 2 digit codes are the leading numerals in the FIPS codes. Not suprisingly, there are high positive correlations between the percentage of households that identify as Hispanic, and the percent of the population that speaks Spanish.
Looking at the graduation data, the total graduation rate is strongly or moderately correlated to the graduation rate of special populations. And, the graduation rates of special populations are strongly or moderately correlated to each other. This is not suprising. It appears that schools with high graduation rates in general have high graduation rates of special populations, and vice versa.
The total graduation rate is only mildly negatively correlated to the percent of the population below the poverty level (-0.22). Interestingly, the graduation rates of students with disabilities (CWD) are mildly negatively correllated with poverty level, suggesting that students with disabilities graduate at a lower rate if the school serves a higher percentage of economically disadvantaged people. Likewise, CWD is mildly negatively correlated with the percentage of people identifying as Black (-0.26). The graduation rates for white students (MWH) is also mildly negatively correlated with the percent of the population that is Black (-0.23). Given that the percent of the Black population is moderately positively correlated with the poverty level (0.39), the negative correlation between the graduation rate of white students and the percent of the population identifying as Black could be related to poverty, even though these correlations are not obvious in a pair-wise consideration.
It is hard to visualize correlations with a large table, so I’ll use a heatmap to help me visually parse the correlation data better.
Red denotes a negative correlation, while blue denotes a positive correlation. The color is graduated based on the strength of the correlation. The heatmap allows us to pickout additional trends in the data. If I focus on the right-most column of the heatmap, this shows the correlations between the total graduation rate and all other variables. A few trends are evident. As mentioned above, the graduation rate is positively correlated to the graduation rate of special populations. It is also mildly positively correllated to geographic variables (state, county), the total population, the percent Whites, percent of people who speak only English, and certain age demographic groups (population under 5, population between ages 5-17 and 25-44). It is negatively correlated with unemployment rate, predominately speaking a language other than English, percent of non-white populations, poverty level, percent of non-high school graduates over age 25, and certain age demographics (population between 18-24, 45-64, and 65 plus).
In the following bi-variate analysis I will only focus on how the total graduation rate varies with the graduation rates of special populations, poverty level, and education level of the population over age 25. I have already explored how graduation rate varies by geographic location. I will focus on the correlations between graduation rate and the rest of the demographic data in the multi-variate analysis.
## [1] 0.6732302
## [1] 0.705901
These plots look at the graduation rates of special populations vs. the total graduation rate. The size of a point on the plot is mapped to the number of observations at that point. The blue line is a linear fit of the data, with the shaded region indicating 95% confidence intervals for the fit. This visualization is useful in picking out outliers. There is a high positive correlation between the total graduation rate of students and the graduation rates of both students with disabilities (0.67) and economically disadvantaged students (0.71). School that do a good job graduating these special populations graduate a high number of students overall, and vice versa. When I look at the correlation heatmap above, these are the most significant correlations between the total graduation rate and all other variables. It suggests that in order to increase overall graduation rates, school should put policies in place that increase the graduation rates of these two special populations.
What does the correlation between special populations look like?
## [1] 0.5433596
Not suprisingly, there is a moderate positive correlation (0.54) between the graduation rates of students with disabilities and economically disadvantaged students. Schools appear to do either a good or poor job of graduating both these special populations.
Next, I want to see if racial group is correllated to overall graduation rate.
These plots look at the correlation between the graduation rate of students of a particular race vs. all students. As the heatmap and correlation coefficients showed, there is a positive correlation between the total graduation rate and the graduation rates of students of a particular race. However, there are some outliers, as these charts clearly show. Some schools that have an overall high graduation rate have a low graduation rate for minority students, particularly Native Americans. This is also true for Black, Hispanic, and Asian students. I wonder what factors could be contributing to this phenomena. Perhaps the graduation rate of minority students is lower when they are a minority in the general population. I’ll look at how the graduation rate of minorities is corellated with the percent of the population that is White.
## , , MHI_White.cor = -0.106657062926838, MAM_White.cor = -0.00336407794216555, MAS_White.cor = -0.087066060157118, MTR_White.cor = 0.0704732024778238
##
## MBL_White.cor
## MWH_White.cor -0.00604672504230318
## 0.218466535974701 1
This avenue of exploration appears to be a dead end. There are no significant correlations between the graduation rates of minority students and the percent of the population that is white. There is a small correlation between the graduation rates of white students and the percent of the population that is white (0.22), but this may be due to other demographic factors, such as poverty level.
Let me check the corellation between the percent of a population that is white and poverty level:
There is a moderate negative correlation (-0.48) between the percent of the population that is white and the percent of people below the poverty level. The observed correlation between the graduation rates of white students and the percent of the population that is white could be due to other demographics like the poverty rate instead of race.
I’ll look at how poverty rate is correlated to graduation rate:
## [1] -0.2634138
There is a mild negative correlation (-0.26) between the graduation rate and the percent of people below the poverty level. However, on this chart, the outliers are most interesting. There are 10 districts where both the graduation rate and the poverty rate are above 75%. It may be interesting to explore this subset of data in more detail later.
But right now, I want to look at how other demographic variables influence graduation rate.
Another factor that may be important in graduation rates is the education level of the general population. I want to look at if this effects graduation rates, and if it does so, how.
## Not_Grad.cor
## College_Grad.cor -0.197205831335827
## 0.24479844300728 1
## [1] -0.4182655
There is a mild positive correlation (0.24) between high school graduation rate and percent of college graduates in a population. There is a mild negative correlation (-0.20) between high school graduation rate and the percent of non-high school grads in a population. These results are what I expected, but may be due to poverty or other demographic factors that are correllated with education level in the census data. If I look back at the heatmap of correlations, I see that education level is negatively correlated to percent of persons below the poverty level and to the unemployment rate. The correlation between poverty rate and college degree is -0.42. This observation again suggests that high school graduation rates are affected by multiple, complex demographic variables.
Now I want to look at how graduation rates are affected by the age distribution of the population, race, and language.
This plot is a scatterplot of graduation rate vs. percent of population binned by age. There appears to be no strong correlation between graduation rates and the age distribution of the population. I will not consider these variables in later analysis.
I wonder if there is any corellation between race and graduation rate. I saw earlier that there was a mild positive correlation between graduation rates and the percent of whites in a population. I wonder if there is any corellation between the total graduation rate and minorities.
## [1] -0.05220146
## [1] 0.2140248
## [1] -0.1923399
## [1] -0.2180079
## [1] 0.1049234
## [1] -0.02561316
## [1] 0.004701983
This plot looks at graduation rate vs. percent of the population of a particular race. There are mild correlations between graduation rates % population that is Black (-0.20), and Native American (-0.22). The other corellations are not significant. It’s interesting that there is not a correlation between the graduation rate and the percent of the population that is Hispanic. The strongest correlation observed in the entire dataset is between the percent of pepole in a census tract that identify as Hispanic and the percent of people who speak Spanish. There is almost no correlation between graduation rates and those that identify as Hispanic (-0.05), but I wonder if the percent of people that speak a language other than English is at all correlated to graduation rates.
## [1] 0.03870797
## [1] -0.03870797
## [1] -0.05641179
## [1] 0.06767744
## [1] -0.01143481
## [1] 0.001626427
## [1] -0.0009245437
## [1] 0.0008475358
There are no strong correlations between graduation rates and language. I will not pursue this line of questioning further.
I’ve seen in both the bivariate analysis and multivariate analysis so far that graduation rates cannot be correllated to a single demographic variable. Likely, it is a combination of demographic variables that influence graduation rates in communities.
I’d now like to look at some of the variables that showed mild correlations to graduation rates, and see if they strengthen one another.
Since adult education level and unemployment rates are both correlated with poverty rates, lets see how these factors together influence graduation rates.
There appears to be a trend in this plot, let me take a closer look at the lower left hand quadrant and upper right hand quadrant:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 4.892 8.087 9.333 12.150 92.780 5
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.372 14.480 20.300 22.350 28.220 86.670 4
These two plots are the lower left and upper right quadrants of the plot above. Lower graduation rates appear to be correlated with a higher poverty rate and a higher percent of adults without a high school degree. These plots are split along the mean graduation rate (83%) and the mean % of adults without a high school degree (14%). The mean poverty rate for census tracts with a higher than average graduation rate and lower than average % of adults without a high school degree is 9%. The poverty rate for the opposite case is 22%.
I saw earlier that there was a correlation between the graduation rate of economically disadvantaged students and the total graduation rate. I wonder if there is a correlation between poverty rates, the graduation rates of economically disadvantaged students, and the total graduation rate.
There does not appear to be a strong correlation between poverty rate, the graduation rate of ECD students, and the total graduation rate.
The graduation rates of Native Americans and Blacks are lower than those of white students. On the heatmap above we also saw that the percent of people below the poverty level was correlated with the percent of people identifying as Black and Native American, and Hispanic. I’d like to explore if poverty is a factor in low graduation rates of these minorities.
In this plot we see that a higher percentage of Native Americans in the population is correlated with a higher poverty level, but not necessarily a lower graduation rate for Native American students. There appears to be an equal distribution of high poverty and low poverty census tracts across all quintiles of NAm grad rates. A factor other than poverty may explain lower graduation rates for these students.
I wonder if there is a relationship between Black student graduation rates and poverty.
There is more data on these plots, so it is harder to see trends in the data. I will need to use a different plot to look at this data.
##
## Welch Two Sample t-test
##
## data: GradBlkPov.sub1$MBL_quint and GradBlkPov.sub2$MBL_quint
## t = 3.5644, df = 462.72, p-value = 0.0004026
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.0670469 0.2318136
## sample estimates:
## mean of x mean of y
## 4.288803 4.139373
In this plot, I am looking at the graduation and poverty rates of two populations: communities where the percent of Blacks is less than 25%, and those where the percent of Blacks is greater than 75%. It’s interesting that the school districts that have the lowest graduation rates of Black students in a cohort are in communities that have a low population of Blacks and have a low poverty rate. However, the means of the Black student graduation rate are similar: 4.288803 for communities with a low Black population, and 4.139373 for communities with a high Black population. I performed a Welch’s t-test on these two samples to see if this difference in means is significant. P-critial was chosen to be 0.05. The calculated p-value was 0.0004026, so this difference is statistically significant. Black students graduate at lower rates when the surrounding community is majority Black, compared to when the surrounding community is not majority Black. This effect could be explained by poverty rates: in majority Black communities, the poverty rate is greater, on average, than in minority Black communities. Let me next consider the two extremes: the graduation rates of Black students in communities where there is low poverty and high percent Black, and high poverty, low percent Black:
##
## Welch Two Sample t-test
##
## data: GradBlkMBL.sub1$MBL_quint and GradBlkMBL.sub2$MBL_quint
## t = 1.796, df = 49.088, p-value = 0.07865
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02467678 0.43982922
## sample estimates:
## mean of x mean of y
## 4.411765 4.204188
Ok, this is interesting. There is no statistically significant difference between these two samples (p-value = 0.07865). The graduation rates of Black students in high poverty, majority Black communities are statistically the same as the graduation rates of Black students in low poverty, majority Black communities. So, to recap: Black students graduate at lower rates when the surrounding community is Black, but this observation cannot be explained by poverty alone.
I wonder if I see this same result for Hispanic students (the other large minority in the US)?
Again, I’m going to subset the data in this plot like I did before to look at the data from minority Hispanic communities and majority Hispanic communities.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 7.084 11.490 13.480 17.330 92.780 10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 16.21 23.16 24.37 30.24 60.77
##
## Welch Two Sample t-test
##
## data: GradHisPov.sub1$MHI_quint and GradHisPov.sub2$MHI_quint
## t = -9.7608, df = 626.68, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4375686 -0.2909913
## sample estimates:
## mean of x mean of y
## 4.23572 4.60000
Again, I see that the overall poverty rate for majority Hispanic communities is higher than in those communities where Hispanics are a minority (13.480 vs. 24.37). This is the same observation I made above with respect to majority Black communities. There is also a greater difference in the average graduation rates of Hispanic students in minority vs. majority Hispanic communities (minority = 4.23572, majority = 4.6000). This difference is statistically significant, with a p-value of 2.2e-16. What’s really intriguing is that Hispanic students graduate at higher rates in communities with a majority Hispanic population, compared to communities with a minority Hispanic population. This is the opposite effect of what I observed with Black students.
Could this be a poverty rate effect? The difference in poverty rates between majority Hispanic and minority Hispanic communities is not as pronounced as it is between majority Black and minority Black communities. I’ll subset the data again, this time looking at extremes in the poverty rate instead of extremes in % Hispanic.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.000 4.000 5.000 4.694 5.000 5.000 17
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 3.00 4.00 3.88 4.00 5.00 366
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 4.00 4.00 4.33 5.00 5.00 4840
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.000 4.000 5.000 4.571 5.000 5.000 16
##
## Welch Two Sample t-test
##
## data: GradHisMHI.sub1$MHI_quint and GradHisMHI.sub2$MHI_quint
## t = 8.4261, df = 158.37, p-value = 2.053e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.6228541 1.0042427
## sample estimates:
## mean of x mean of y
## 4.693548 3.880000
I’ve seen that Hispanic students graduate at a higher rate when the surrounding population is Hispanic. Looking at this set of four plots, I can see that the demographic groups with the highest mean graduation rates are low poverty, high Hispanic (4.694), and high poverty, high Hispanic (4.571). This observation supports the conclusion above that Hispanic students graduate at higher rates in majority Hispanic communities, and this appears to be independent of poverty rates. This is in contrast to Black students that appear to graduate at higher rates in minority Black communities, once again, independent of poverty rates.
In summary, I’ve seen that, overall, poverty rates are negatively correllated with overall graduation rates. However, when I look at the graduation rates of economically disadvantaged students, Native Americans, Blacks, and Hispanics, the poverty rate is not a strong contributor to graduation rates for these populations. There appear to be many variables affecting graduation rates, it may be useful in the future to use factor analysis or PCA to identify other variables of interest that may be contributing to graduation rates.
The final analysis I want to perform is one that looks at the correlation between graduation rates, state, poverty, and racial groups. What is still puzzling to me is why some states have significantly lower graduation rates than other states. Are their demographics that different?
Confirming the observations made above, when the data is grouped by state, there is a positive correlation between high school graduation rate and the percent of the population that is white, and a negative correlation between graduation rate and the percent of the population that is Black or Native American. There is no correlation between graduation rate and the percent of the population that is Hispanic. There may also be a mild correlation observed between the percent of the population that is Black, the poverty rate, and the total graduation rate when the data is grouped by state. A possible regional variation exists as well. i want to look at this plot in more detail.
When I look at a larger version of this plot, I can see that states in the deep South (NC, AL, SC, MS, LA, GA, FL) have a lower than average high school graduation rate, a higher percent of the population that is Black, and a higher poverty rate than other states outside of the deep South that have higher high school graduation rates. It suggests that the deep South may be an area to focus resources on if we want to make the biggest change in high school graduation rates nationally. Policies that increase integration of Black students into more affluent schools and decrease poverty rates overall may be beneficial in raising graduation rates in the South.
These plots look at other variables besides race that I’ve determined influence high school graduation rates: poverty rate, % of adults without a high school degree, and % of adults with a college degree. When I look at this state-level data, it strikes me that there is no correlation between high school graduation rates and not having a high school degree, even though I saw a correlation when I looked at the individual census tract data. There must be other contributing variables that get averaged out when I consider only state level data.
In summary, the multi-variate analysis has revealed some interesting trends. When looking at the entire population, poverty rates and education level of adults in a community both effected high school graduation rates. There were some surpising trends:1) On average, Black students in majority Black communities graduate with at a lower rate, irrespective of the poverty rate of the community. 2) On average, Hispanic students in majority Hispanic communities graduate at a higher rate, irrespective of poverty rate. 3) There is a regional difference in the percent of Black students that graduate, with Black graduation rates being depressed in the Deep South, and correlated with poverty rate, at a state-level.
In the United States, high school graduation rates vary by state and geographic region. Graduation rates are lower in the South and Great Plains than they are in the NorthEast or Rust Belt. Within states there is also varibility at the county level. For examples, counties on the Gulf Coast in Texas have very high graduation rates, while counties on the New Mexico border have generally lower graduation rates. This local variation may be related to different population demographics within each county. Note: areas shaded black on the plots indicate that there is no data for these regions in the dataset.
There is a negative correlation between high school graduation rate and the percent of people below the poverty level. (Correlation coefficient of -0.26: the best fit linear regression line is in blue.) This correlation suggests that implementing policies that decrease the poverty level in an area could also increase the high school graduation rate.
This plot shows the relationship between high school graduation rate, the percent of the population that is Black, and the poverty rate averaged by state. Mean high school graduation rate is 80%, and indicated by a blue line on the plot. Mean percent of the population that is Black is 9.5%, and indicated by a black line on the plot. The mean poverty rate, averaged by state, is 15%, and colored light green in the plot legend.
States in the deep South (AL, SC, MS, LA, GA, FL) have an average high school graduation rate lower than the mean. They also have a higher than average percent of the population that is Black, and a higher poverty rate than other states. The deep South is an area where the biggest regional impact on graduation rates can be made. Policies that increase integration of Black students into more affluent schools and decrease poverty rates overall may be beneficial in raising graduation rates in the South, and hence the overall graduation rate in the United States.
The purpose of this EDA was to translate the Department of Education graduation data into actionable insights that would lead to higher graduation rates across the US. I think I partially succeeded in this goal. I first looked at the Department of Education data and determined that schools that were successful in educating minorities and special populations of students had higher graduation rates of all students overall. I then focused my analysis on understanding the demographic variabiables in the data and how those variables effected graduation rates. Did schools with higher graduation rates have different demographics than schools with lower graduation rates? I concluded that there was no one strong demographic variable that could account for differences in graduation rates. Rather, it was a combination of poverty rate, percent of minority populations, primarily English speaking populations, and other unidentified factors that were likely to influence graduation rates.
Since this EDA has not identified any single factor a major influencer of graduation rates, future work will focus on into the data more deeply using machine learning algorithms and possibly adding more data to the analysis. The next step that will enrich this analysis is to perform a principle component analysis (PCA) on the variables I chose to work with. A PCA will allow me to identify the factors (variables) that most contribute to the variance in the dataset. I can also perform this analysis on the entire dataset of 580 variables, perhaps achieving additional insights into the data. After performing PCA, it would also be interesting to build a model of the data using machine learning algorithms to see if I could predict which demographic would result in higher graduation rates.
This was a challenging dataset to work with for several reasons. First, I had to narrow down the number of variables I looked at from 580 in the starting dataset. using intuition and based on research into graduation rates, I narrowed my focus down to 83 variables of interest. Even then, I had to tidy the data, make new variables, and reshape some of the variables before using them inplotting and further analysis. There were times that I felt like I was drowning in data, and I had to remain focused on the main goal of the project. Even so, one of the main drawbacks of this analysis is that it is wide - I tried to look at many variables - instead of deep. I further analysis of the data using machine learning willhelp to rectify this shortcoming.
While I struggled through the analysis of the data, I also had some successes. I was able to determine which variables, in combination, explained some of the variance in graduation rates. I was suprised that race was in influencer of graduation rates, even when separated from underlying demographic variables. This suggests that there are cultural influences of graduation rates. It would be interesting to collaborate with a social scientist on this data analysis to see if additional insights could be gleaned about what these influences may be.
Overall, this was a fun, challenging project to work on, and I look forward to delving more deeply into the data.